11 research outputs found
SQuId: Measuring Speech Naturalness in Many Languages
Much of text-to-speech research relies on human evaluation, which incurs
heavy costs and slows down the development process. The problem is particularly
acute in heavily multilingual applications, where recruiting and polling judges
can take weeks. We introduce SQuId (Speech Quality Identification), a
multilingual naturalness prediction model trained on over a million ratings and
tested in 65 locales-the largest effort of this type to date. The main insight
is that training one model on many locales consistently outperforms mono-locale
baselines. We present our task, the model, and show that it outperforms a
competitive baseline based on w2v-BERT and VoiceMOS by 50.0%. We then
demonstrate the effectiveness of cross-locale transfer during fine-tuning and
highlight its effect on zero-shot locales, i.e., locales for which there is no
fine-tuning data. Through a series of analyses, we highlight the role of
non-linguistic effects such as sound artifacts in cross-locale transfer.
Finally, we present the effect of our design decision, e.g., model size,
pre-training diversity, and language rebalancing with several ablation
experiments.Comment: Accepted at ICASSP 2023, with additional material in the appendi
Evaluating the Cross-Lingual Effectiveness of Massively Multilingual Neural Machine Translation
The recently proposed massively multilingual neural machine translation (NMT)
system has been shown to be capable of translating over 100 languages to and
from English within a single model. Its improved translation performance on low
resource languages hints at potential cross-lingual transfer capability for
downstream tasks. In this paper, we evaluate the cross-lingual effectiveness of
representations from the encoder of a massively multilingual NMT model on 5
downstream classification and sequence labeling tasks covering a diverse set of
over 50 languages. We compare against a strong baseline, multilingual BERT
(mBERT), in different cross-lingual transfer learning scenarios and show gains
in zero-shot transfer in 4 out of these 5 tasks
Multimodal Modeling For Spoken Language Identification
Spoken language identification refers to the task of automatically predicting
the spoken language in a given utterance. Conventionally, it is modeled as a
speech-based language identification task. Prior techniques have been
constrained to a single modality; however in the case of video data there is a
wealth of other metadata that may be beneficial for this task. In this work, we
propose MuSeLI, a Multimodal Spoken Language Identification method, which
delves into the use of various metadata sources to enhance language
identification. Our study reveals that metadata such as video title,
description and geographic location provide substantial information to identify
the spoken language of the multimedia recording. We conduct experiments using
two diverse public datasets of YouTube videos, and obtain state-of-the-art
results on the language identification task. We additionally conduct an
ablation study that describes the distinct contribution of each modality for
language recognition
Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages
We introduce the Universal Speech Model (USM), a single large model that
performs automatic speech recognition (ASR) across 100+ languages. This is
achieved by pre-training the encoder of the model on a large unlabeled
multilingual dataset of 12 million (M) hours spanning over 300 languages, and
fine-tuning on a smaller labeled dataset. We use multilingual pre-training with
random-projection quantization and speech-text modality matching to achieve
state-of-the-art performance on downstream multilingual ASR and speech-to-text
translation tasks. We also demonstrate that despite using a labeled training
set 1/7-th the size of that used for the Whisper model, our model exhibits
comparable or better performance on both in-domain and out-of-domain speech
recognition tasks across many languages.Comment: 20 pages, 7 figures, 8 table
Building an English-Iraqi Arabic Machine Translation System for Spoken Utterances with Limited Resources
This paper presents an English-Iraqi Arabic speech-to-speech statistical machine translation system using limited resources. In it, we explore the constraints involved, how we endeavored to mitigate such problems as a non-standard orthography and a highly inflected grammar, and discuss leveraging existing plentiful resources for Modern Standard Arabic to assist in this task. These combined techniques yield a reduction in unknown words at translation time by over 40 % and a +3.65 increase in BLEU score over a previous state-of-the-art system using the same parallel training corpus of spoken utterances. Index Terms: speech translation, limited resources, Arabic 1
FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech
We introduce FLEURS, the Few-shot Learning Evaluation of Universal
Representations of Speech benchmark. FLEURS is an n-way parallel speech dataset
in 102 languages built on top of the machine translation FLoRes-101 benchmark,
with approximately 12 hours of speech supervision per language. FLEURS can be
used for a variety of speech tasks, including Automatic Speech Recognition
(ASR), Speech Language Identification (Speech LangID), Translation and
Retrieval. In this paper, we provide baselines for the tasks based on
multilingual pre-trained models like mSLAM. The goal of FLEURS is to enable
speech technology in more languages and catalyze research in low-resource
speech understanding